Image by author

Embracing Automated Retraining

How to move away from retraining at a set cadence (or not at all) in favor of a dynamic approach

Claire Longo
Towards Data Science
7 min readMar 17, 2023

--

This piece was co-authored by Trevor LaViale

While the industry has invested a lot in processes and techniques for knowing when to deploy a model into production, there is arguably less collective knowledge on the equally important task of knowing when to retrain a model. In truth, knowing when to retrain a model is hard due to factors like delays in feedback or labels for live predictions. In practice, many models are in production with no retraining at all, use manual retraining methods, or are retraining without optimizing or studying the cadence.

This post is written to help data scientists and machine learning engineering teams embrace automated retraining.

Approaches for Retraining

There are two core approaches to automated model retraining:

  • Fixed: Retraining a set cadence (e.g., daily, weekly, monthly)
  • Dynamic: Ad-hoc triggered retraining based on model performance metrics.

While the fixed approach is straightforward to implement, there are some drawbacks. Compute costs can be higher than necessary, and the frequent retraining can lead to inconsistencies from one model to another, while infrequent retraining schedules can lead to a stale model.

The dynamic approach can prevent models from going stale, and optimize the compute cost. While there are numerous approaches to retraining, here are some recommended best practices for dynamic model retraining that will keep models healthier and performant.

Generalized Retraining Architecture

There is a suite of various tools that can be used to create a model retraining system. This diagram shows how an ML observability platform might integrate into a generalized flow.

There are a wealth of tutorials for specific tooling. Here are a few:

For those ready to skip ahead, you can also take it a step further with Etsy’s take on stateful model retraining.

Retraining Strategy

Automating the retraining of a live machine learning model can be a complex task, but there are some best practices that can help guide the design.

Metrics To Trigger Retraining

The metrics used to trigger retraining will depend on the specific model and use-case. Each metric will need a threshold set. The threshold will be used to trigger retraining when the performance of the model falls below the threshold. This is where monitors can come into play. When a performance monitor fires in a model monitoring platform, you can then programmatically query the performance and drift metrics to evaluate whether retraining is needed.

Ideal metrics to trigger model retraining:

  • Prediction (score or label) drift
  • Performance metric degradation
  • Performance metric degradation for specific segments/cohorts.
  • Feature drift
  • Embeddings drift

Drift is the measure of the distance between two distributions. It is a meaningful metric for triggering model retraining because it indicates how much your production data has shifted from a baseline. Statistical drift can be measured with various drift metrics.

The baseline dataset used to calculate drift can be derived from either the training dataset, or a window of production data.

Ensuring the New Model Is Working

The new model will need to be tested or validated before promoting it to production to replace the old one. There are a few recommended approaches here:

  • Human review
  • Automated metric checks in CI/CD pipeline

Strategy for Promoting the New Model

The strategy for promoting the new model will depend on the impact that the model has on the business. In some cases, it may be appropriate to automatically replace the old model with the new model. But in other cases, the new model may need to be A/B test live before replacing the old model.

Some strategies for live model testing to consider are:

  • Champion vs. Challenger — serve production traffic to both models but only use the prediction/response from the existing model (champion) in the application. The data from the challenger model is stored for analysis but not used.
  • A/B testing — split production traffic to both models for a fixed experimentation period. Compare key metrics at the end of the experiment and decide which model to promote.
  • Canary deployment — start by redirecting a small percentage of production traffic to the new model. Since it’s in a production path, this helps to catch real issues with the new model but limits the impact to a small percentage of users. Ramp up the traffic to the new model until the new model receives 100% of the traffic.

Retraining Feedback Loop Data

Once we identify that the model needs to be retained, the next step is to choose the right dataset to retrain with. Here are some recommendations to ensure the new training data will improve the models performance:

  • If the model performs well overall, but is failing to meet optimal performance criteria on some segments, such as specific feature values or demographics, the new training dataset should contain extra data points for these lower performing segments. A simple upsampling strategy can be used to create a new training dataset that targets these low performing segments
  • If the model is trained on a small timeslice, the training dataset may not accurately capture and represent all possible patterns that will appear in the live production data. To prevent this, avoid training the model on recent data alone. Instead, use a large sample of historical data, and augment this with the latest data to add additional patterns for the model to learn from.
  • If your model architecture follows the transfer learning design, new data can simply be added to the model during retraining, without losing the patterns that the model has already learned from previous training data.

Dashboards from a model monitoring platform (i.e. Arize — full disclosure: I work for Arize) are great for tracking and comparing model live performance during these tests. Whether the model is tested as a shadow deploy, live A/B test, or simply an offline comparison, these dashboards offer a simple way to view a side by side model comparison. The dashboards can also easily be shared with others to demonstrate model performance improvements to stakeholders.

Quantifying ROI

Overall, it’s important to have a clear understanding of your business requirements and the problem you are trying to solve when determining the best approach for automating the retraining of a live machine learning model. It’s also important to continuously monitor the performance of the model and make adjustments to the retraining cadence and metrics as needed.

Measuring Cost impact:

Although it is challenging to calculate direct ROI for some tasks in AI, the value of optimized model retraining is simple, tangible, and possible to calculate directly. The compute and storage costs for model training jobs are often already tracked as part of cloud compute costs. Often, the business impact of a model can be calculated as well.

When optimizing retraining, we are considering both the retraining costs, and the impact of model performance to the business (“AI ROI”). We can weigh this cost against each other to justify the cost of model retraining.

Here, we propose a weekly cost calculation, although this calculator can be adapted to a different cadence such as daily or monthly depending on the model’s purpose and maintenance needs.

Image by author

Consider scenario 1, a case where the model is retraining too frequently.

My model costs $200 to retrain. I train my model 1x per day. This model maintained a steady average weekly accuracy of 85%. I set up a pipeline to automatically retrain based on prediction score drift greater than 0.25 PSI and accuracy. Based on the new rule, my model starts retraining only twice a week, and maintains that accuracy of 85%.

Comparison of weekly maintenance costs:

Old model maintenance cost: 7*$200 = $1400

New model maintenance cost 2*$200= $400

That’s a x% reduction in model maintenance costs. Although this is a simple contrived example, the magnitude of cost savings can be on this scale.

Consider scenario 2, a case where the model is not retrained enough.

My model costs $200 to train. I train my model once per week. This model maintained a steady average weekly Accuracy of 65%. I set up a pipeline to automatically retrain based on prediction score drift greater than 0.25 PSI. Based on the new rule, my model retrains twice a week, and has achieved a better Accuracy of 85%.

Comparison of weekly maintenance costs:

Old model maintenance cost: 1*$200 = $200 for 65% accuracy

New model maintenance cost: 2*$200= $400 for 85% accuracy

So for a higher price, better model performance has been achieved. This can be justified and profitable if the AI ROI values are higher than the retraining costs. Lack of frequent retraining could have been leaving money on the table.

Conclusion

Transitioning from model retraining at fixed intervals to automated model retraining triggered by model performance offers numerous benefits for organizations, from lower compute costs at a time when cloud costs are increasing to better AI ROI from to improved model performance. Hopefully this blog provides a template for teams to take action.

--

--

Full Stack Data Scientist/Machine Learning Engineer, Recommender Systems Specialist, ML Platform Builder, Central ML Team Advocate, bad Poker Player (try me)